Voice activity detection based on fusion of audio and visual information
نویسندگان
چکیده
In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. Audioonly VAD systems typically are not robust to (acoustic) noise. Incorporating visual information, for example information extracted from mouth images, can improve the robustness since the visual information is not affected by the acoustic noise. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audio-only and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectiveness of these methods and also compare model-based and model-free methods for VAD. Experimental results show feature fusion methods to generally be more effective, and decision fusion methods generally perform better using model-free methods.
منابع مشابه
A New Algorithm for Voice Activity Detection Based on Wavelet Packets (RESEARCH NOTE)
Speech constitutes much of the communicated information; most other perceived audio signals do not carry nearly as much information. Indeed, much of the non-speech signals maybe classified as ‘noise’ in human communication. The process of separating conversational speech and noise is termed voice activity detection (VAD). This paper describes a new approach to VAD which is based on the Wavelet ...
متن کاملDecision fusion by boosting method for multi-modal voice activity detection
In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audioonly and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectivenes...
متن کاملA robust audio-visual speech recognition using audio-visual voice activity detection
This paper proposes a novel speech recognition method combining Audio-Visual Voice Activity Detection (AVVAD) and Audio-Visual Automatic Speech Recognition (AVASR). AVASR has been developed to enhance the robustness of ASR in noisy environments, using visual information in addition to acoustic features. Similarly, AVVAD increases the precision of VAD in noisy conditions, which detects presence ...
متن کاملAudiovisual speech source separation: a regularization method based on visual voice activity detection
Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g. lip parameters tracking) with source separation methods to improve and/or simplify the extraction of a speech signal from a mixture of acoustic signals. In this paper, we present a new approach to this problem: visual information is used here as a voice activity detector (VAD). Results show that, ...
متن کاملTwo-layered audio-visual integration in voice activity detection and automatic speech recognition for robots
Automatic Speech Recognition (ASR) which plays an important role in human-robot interaction should be noise-robust because robots are expected to work in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve the robustness in such environments. This paper proposes two-layered AV integration for ASR which applies AV integration to Voice Activity Detection (VAD) and...
متن کامل